Add regression tests for ordered array_agg after unnest with window functions (issue #20788)#21472
Draft
kosiew wants to merge 8 commits intoapache:mainfrom
Draft
Add regression tests for ordered array_agg after unnest with window functions (issue #20788)#21472kosiew wants to merge 8 commits intoapache:mainfrom
array_agg after unnest with window functions (issue #20788)#21472kosiew wants to merge 8 commits intoapache:mainfrom
Conversation
Verify query result for unnest -> row_number -> group by -> array_agg with ordered output. Ensure physical plan includes UnnestExec, BoundedWindowAggExec, and ordered array_agg path.
Introduce an ordered EXPLAIN regression case next to the existing unordered coverage in unnest.slt. Add a comment clarifying why unordered array_agg does not suffice for issue 20788. Update the Rust plan assertion in basic.rs to be less brittle by verifying the aggregate shape instead of relying on the exact alias-qualified ORDER BY expression text.
Inline single-use dataframe binding and reduce repetition in assert_contains! calls using a loop. Maintain all public interfaces unchanged.
Rebuild arrays based on original element position rather than value order. Update basic.rs and unnest.slt by replacing synthetic ROW_NUMBER() with a paired unnest for ordinal generation. Switch input rows to explicit row_idx literals to eliminate dependency on ROW_NUMBER(). Add a correctness query to the SLT case in addition to explain coverage.
Reintroduce explicit row ids, unnest, and generated original element ordinals. Implement ROW_NUMBER() over partitioned indices for accurate ordering. Adjust Rust-side plan assertion for improved stability by focusing on higher-signal operators and sort shape instead of exact normalized casts.
Extract shared array_agg_after_unnest_sql helper. Enhance ordered_array_agg tests to compare plan shapes. Add ordered_array_agg_after_unnest_explain_analyze_metrics to verify runtime metrics and memory usage. Adjust unordered test assertions to reflect optimizer behavior and demonstrate efficiency of unordered aggregation path.
…der handling - Updated the `array_agg_after_unnest_sql` function to accept an `ArrayAggOrder` enum instead of a boolean to improve readability. - Enhanced code clarity by explicitly defining ordered and unordered aggregation cases. - Updated usage and related SQL execution calls in tests for better consistency.
Move ArrayAggOrder SQL fragment selection to enum method. Extract common functionality for SessionContext, EXPLAIN ANALYZE wrapping, physical-plan generation, and grouped substring assertions. Simplify regression tests by removing repeated plan-building and assertion boilerplate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
This PR introduces targeted regression coverage for a query pattern involving
unnest,row_numberwindow functions, andarray_aggwithORDER BY. This pattern is known to trigger a different execution path in DataFusion compared to unorderedarray_agg, potentially leading to higher memory usage.Existing tests only covered the unordered variant of
array_agg, which uses the optimizedGroupsAccumulatorfast path. However, whenORDER BYis introduced insidearray_agg, DataFusion switches to an order-sensitive accumulation strategy that does not use this fast path.This change ensures that:
array_aggis explicitly captured and validated.What changes are included in this PR?
Added a new Rust regression test (
ordered_array_agg_after_unnest_regression) that:Validates correctness of ordered
array_aggafterunnestand window functions.Asserts expected physical plan structure, including:
UnnestExecBoundedWindowAggExecSortExecAggregateExecinSinglePartitionedmodeVerifies that ordered aggregation avoids partial/final aggregation modes.
Added a comparison with the unordered variant to highlight execution differences:
PartialandFinalPartitionedaggregation for unordered case.Added an
EXPLAIN ANALYZEtest (ordered_array_agg_after_unnest_explain_analyze_metrics) that:Extended
sqllogictestcoverage (unnest.slt) with:EXPLAINtest showcasing the logical and physical plan for orderedarray_agg.array_aggis insufficient coverage.Introduced helper utilities in tests:
ArrayAggOrderenum to toggle ordered vs unordered queries.Are these changes tested?
Yes.
The PR includes multiple layers of test coverage:
Correctness Tests
array_aggpreserves element ordering afterunnest.Plan Shape Assertions
Explain Plan Tests (SQLLogicTest)
Execution Metrics Validation
EXPLAIN ANALYZEto assert row counts and memory-related metrics.These tests collectively ensure both functional correctness and execution stability.
Are there any user-facing changes?
No.
This PR only adds test coverage and documentation. There are no changes to public APIs or user-facing behavior.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.